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Trending: The Promises and the Challenges of Big Social Data 


LEV MANOVICH 


oday, the term “big data” is often used in popular media, business, computer 

science, and the computer industry. For instance, in June 2008, Wired maga- 

zine opened its special section on “The Petabyte Age” by stating, “Our ability 
to capture, warehouse, and understand massive amounts of data is changing science, 
medicine, business, and technology. As our collection of facts and figures grows, so 
will the opportunity to find answers to fundamental questions.” In February 2010, 
The Economist started its special report “Data, Data Everywhere” with the phrase 
“the industrial revolution of data” (coined by computer scientist Joe Hellerstein) 
and then went to note that “the effect is being felt everywhere, from business to sci- 
ence, from government to the arts.” 

Discussions in popular media usually do not define big data in qualitative 
terms. However, in the computer industry, the term has a more precise meaning: 
“Big Data is a term applied to data sets whose size is beyond the ability of commonly 
used software tools to capture, manage, and process the data within a tolerable 
elapsed time. Big data sizes are a constantly moving target currently ranging from 
a few dozen terabytes to many petabytes of data in a single data set” (“Big Data”). 

Since its formation in 2008, the Office of Digital Humanities at the National 
Endowment for Humanities (NEH) has been systematically creating grant oppor- 
tunities to help humanists work with large data sets. The following statement from 
a 2011 grant competition organized by the NEH together with a number of other 
research agencies in the United States, Canada, the UK, and the Netherlands pro- 
vides an excellent description of what is at stake: “The idea behind the Digging 
into Data Challenge is to address how ‘big data’ changes the research landscape for 
the humanities and social sciences. Now that we have massive databases of materi- 
als used by scholars in the humanities and social sciences—ranging from digitized 
books, newspapers, and music to transactional data like web searches, sensor data 
or cell phone records—what new, computationally-based research methods might 
we apply? As the world becomes increasingly digital, new techniques will be needed 
to search, analyze, and understand these everyday materials” (“Digging into Data 
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Challenge”). The projects funded by the 2009 Digging into Data Challenge and the 
earlier NEH 2008 Humanities High Performance Computing Grant Program have 
begun to map the landscape of data-intensive humanities. They include analysis of 
the correspondence of European thinkers between 1500 and 1800; maps, texts, and 
images associated with nineteenth-century railroads in the United States; crimi- 
nal trial accounts (data size: 127 million words); ancient texts; detailed 3-D maps 
of ancient Rome; and the research by my lab to develop tools for the analysis and 
visualization of large image and video data sets. 

At the moment of this writing, the largest data sets being used in digital human- 
ities projects are much smaller than big data used by scientists; in fact, if we use 
industry’s definition, almost none of them qualify as big data (ie., the work can 
be done on desktop computers using standard software, as opposed to supercom- 
puters). But this gap will eventually disappear when humanists start working with 
born-digital user-generated content (such as billions of photos on Flickr), online 
user communication (comments about photos), user created metadata (tags), and 
transaction data (when and from where the photos were uploaded). This web con- 
tent and data is infinitely larger than all already digitized cultural heritage; and, in 
contrast to the fixed number of historical artifacts, it grows constantly. (I expect that 
the number of photos uploaded to Facebook daily is larger than all artifacts stored 
in all the world’s museums. ) 

In this chapter, I want to address some of the theoretical and practical issues 
raised by the possibility of using massive amounts of such social and cultural data 
in the humanities and social sciences. My observations are based on my own experi- 
ence working since 2007 with large cultural data sets at the Software Studies Initia- 
tive (softwarestudies.com) at the University of California, San Diego (UCSD). The 
issues that I will discuss include the differences between “deep data” about a few 
people and “surface data” about many people, getting access to transactional data, 
and the new “data analysis divide” between data experts and researchers without 
training in computer science. 

The emergence of social media in the middle of the 2000s created opportunities 
to study social and cultural processes and dynamics in new ways. For the first time, 
we can follow imaginations, opinions, ideas, and feelings of hundreds of millions of 
people. We can see the images and the videos they create and comment on, moni- 
tor the conversations they are engaged in, read their blog posts and tweets, navigate 
their maps, listen to their track lists, and follow their trajectories in physical space. 
And we don’t need to ask their permission to do this, since they themselves encour- 
age us to do so by making all of this data public. 

In the twentieth century, the study of the social and the cultural relied on two 
types of data: “surface data” about lots of people and “deep data” about the few indi- 
viduals or small groups.' The first approach was used in all disciplines that adapted 
quantitative methods (i.e., statistical, mathematical, or computational techniques 
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for analyzing data). The relevant fields include quantitative schools of sociology, 
economics, political science, communication studies, and marketing research. 

The second approach was used in humanities fields such as literary studies, art 
history, film studies, and history. It was also used in qualitative schools in psychol- 
ogy (for instance, psychoanalysis and Gestalt psychology), sociology (Wilhelm Dil- 
they, Max Weber, Georg Simmel), anthropology, and ethnography. The examples 
of relevant methods are hermeneutics, participant observation, thick description, 
semiotics, and close reading. 

For example, a quantitative sociologist worked with census data that covered 
most of the country’s citizens. However, this data was collected only every ten years, 
and it represented each individual only on a macro level, living out her or his opin- 
ions, feelings, tastes, moods, and motivations (“U.S. Census Bureau”). In contrast, 
a psychologist would be engaged with a single patient for years, tracking and inter- 
preting exactly the kind of data that the census did not capture. 

In between these two methodologies of surface data and deep data were sta- 
tistics and the concept of sampling. By carefully choosing her sample, a researcher 
could expand certain types of data about the few into the knowledge about the 
many. For example, starting in the 1950s, the Nielsen Company collected television 
viewing data in a sample of American homes (via diaries and special devices con- 
nected to television sets in twenty-five thousand homes) and then used this sam- 
ple data to predict television ratings for the whole country (i.e., percentages of the 
population which watched particular shows). But the use of samples to learn about 
larger populations had many limitations. 

For instance, in the example of Nielson’s television ratings, the small sample did 
not tell us anything about the actual hour-by-hour, day-to-day patterns of television 
viewing of every individual or every family outside of this sample. Maybe certain 
people watched only news the whole day; others only tuned in to concerts; others 
had the television on but never paid attention to it; still others happened to prefer 
the shows that got very low ratings by the sample group, and so on. The sample 
stats could not tell us anything about this. It was also possible that a particular tele- 
vision program would get zero shares because nobody in the sample audience hap- 
pened to watch it—and in fact, this occurred more than once (“Nielsen Ratings”). 

Imagine that we want to scale up a low-resolution image using a digital image 
editor like Photoshop. For example, we start with a ten-by-ten pixel image (one 
hundred pixels in total) and resize it to one thousand by one thousand (one million 
pixels in total). We do not get any new details—only larger pixels. This is exactly 
what happens when you use a small sample to predict the behavior of a much larger 
population. A “pixel” that originally represented one person comes to represent one 
thousand people who are all assumed to behave in exactly the same way. 

The rise of social media, along with new computational tools that can process 
massive amounts of data, makes possible a fundamentally new approach to the 
study of human beings and society. We no longer have to choose between data size 
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and data depth. We can study exact trajectories formed by billions of cultural expres- 
sions, experiences, texts, and links. The detailed knowledge and insights that before 
could only be reached about a few people can now be reached about many more 
people. In 2007, Bruno Latour summarized these developments as follows: “The 
precise forces that mould our subjectivities and the precise characters that furnish 
our imaginations are all open to inquiries by the social sciences. It is as if the inner 
workings of private worlds have been pried open because their inputs and outputs 
have become thoroughly traceable.” (Latour). 

Two years earlier, in 2005, PhD student Nathan Eagle at the MIT Media Lab was 
already thinking along the similar lines. He and his advisor Alex Pentland put up 
a website called Reality Mining (“MIT Media Lab: Reality Mining”) and described 
how the new possibilities of capturing details of peoples’ daily behavior and com- 
munication via mobile phones could create sociology in the twenty-first century 
(“Sociology in the 21st Century”). To put this idea into practice, they distributed 
Nokia phones with special software to one hundred MIT students who then used 
these phones for nine months, which generated approximately sixty years of “con- 
tinuous data on daily human behavior.” Eagle and Pentland published a number of 
articles based on the analysis of data they collected. Today, many more computer 
scientists are working with large social data sets; they call their new field “social 
computing.” According to the definition provided by the website of the Third IEEE 
International Conference on Social Computing (2011), social computing refers to 
“computational facilitation of social studies and human social dynamics as well 
as design and use of information and communication technologies that consider 
social context” (“Social Computing”). 

Now let us consider Google search. Google’s algorithms analyze billions of 
web pages, plus PDF, Word documents, Excel spreadsheets, Flash files, plain text 
files, and since 2009 Facebook and Twitter content.” Currently, Google does not 
offer any service that would allow a user to analyze patterns directly in all of this 
data the way Google Insights for Search does with search queries and Google’s 
Ngram Viewer does with digitized books, but it is certainly technologically con- 
ceivable. Imagine being able to study the collective intellectual space of the whole 
planet, seeing how ideas emerge, diffuse, burst, and die and how they get linked 
together and so on—across the data set estimated to contain at least 14.55 billion 
pages (de Kunder). 

To quote Wired's “The Petabyte Age” issue again, “In the era of big data, more 
isn’t just more. More is different.” 

Does all this sound exciting? It certainly does. So what might be wrong with 
these arguments? Are we indeed witnessing the collapse of the deep-data/surface- 
data divide? If so, could this collapse open a new era for social and cultural research? 

I am going to discuss four objections to the optimistic vision I have just pre- 
sented. These objections do not imply that we should not use new data sources about 
human culture and human social life or not analyze them with computational tools. 
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I strongly believe that we should do this. But we need to carefully understand what 
is possible in practice as opposed to in principle. We also need to be clear about 
what skills digital humanists need to take advantage of the new scale of human data. 

1. Only social media companies have access to really large social data sets, espe- 
cially transactional data. An anthropologist working for Facebook or a sociologist 
working for Google will have access to data that the rest of the scholarly commu- 
nity will not. 

A researcher can obtain some of this data through APIs provided by most social 
media services and the largest media online retailers (YouTube, Flickr, Amazon, 
etc.). An API (Application Programming Interface) is a set of commands that can 
be used by a user program to retrieve the data stored in a company’s databases. For 
example, the Flickr API can be used to download all photos in a particular group and 
also to retrieve information about the size of each photo, available comments, geolo- 
cation, the list of people who favored this photo, and so on (“Flickr API Methods”). 

The public APIs provided by social media and social network companies do 
not give all data that these companies themselves are capturing about the users. Still, 
you can certainly do very interesting new cultural and social research by collecting 
data via APIs and then analyzing it—if you are good at programming, statistics, 
and other data analysis methods. (In my lab, we have recently used Flickr API to 
download one hundred and sixty-seven thousand images from the “Art Now” Flickr 
group, and we are currently working to analyze these images to create a “map” of 
“user-generated art.”) 

Although APIs themselves are not complicated, all truly large-scale research 
projects that use the data with these APIs so far have been undertaken by researchers 
in computer science. A good way to follow the work in this area is to look at papers 
presented at yearly World Wide Web conferences (WWW2009 and WWW2010). 
Recent papers have investigated how information spreads on Twitter (data: 100 
million tweets, Kwak, Lee, Park, and Moon), what qualities are shared by the most 
favored photos on Flickr (data: 2.2 million photos), how geotagged Flickr photos are 
distributed spatially (data: 35 million photos, Crandall, Backstrom, Huttenlocher, 
and Kleinberg), and how user-generated videos on YouTube compare with similar 
videos on Daum, the most popular UGC (user-generated content) service in Korea 
(data: 2.1 million videos, Cha, Kwak, Rodriguez, Ahn, and Moon). 

It is worth pointing out that even researchers working inside the largest social 
media companies can’t simply access all the data collected by different services in 
a company. Some time ago, I went to a talk by a researcher from Sprint, one of the 
largest U.S. phone companies, who was analyzing the relations between geographic 
addresses of phone users and how frequently they called other people. He did have 
access to this data for all Sprint customers (around fifty million). However, when he 
was asked why he did not use other data collected by Sprint, such as instant messages 
and apps usage, he explained that these services are operated by a different part of 
the company and that the laws prohibit employees to have access to all of this data 
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together. He pointed out that like any other company, Sprint does not want to get 
into lawsuits for breach of privacy and pay huge fines and damage their brand image; 
therefore they are being very careful in terms of who gets to look at what data. You 
don’t have to believe this, but I do. For example, do you think Google enjoys all the 
lawsuits about Street View? If you were running a business, would you risk losing 
hundreds of millions of dollars and badly damaging your company image? 

2. We need to be careful of reading communications over social networks and 
digital footprints as “authentic.” Peoples’ posts, tweets, uploaded photographs, com- 
ments, and other types of online participation are not transparent windows into 
their selves; instead, they are usually carefully curated and systematically managed 
(Ellison, Heino, and Gibbs). 

Imagine that you wanted to study the cultural imagination of people in Russia 
in the second part of 1930s and you only looked at newspapers, books, films, and 
other cultural texts, which of course all went through government censors before 
being approved for publication. You would conclude that indeed everybody in Rus- 
sia loved Lenin and Stalin, was very happy, and was ready to sacrifice his or her life 
to build communism. You may say that this is an unfair comparison and that it 
would be more appropriate to look instead at people’s diaries. Yes, indeed it would 
be better; however, if you were living in Russia in that period and you knew that any 
night a black car might stop in front of your house and you would be taken away 
and probably shot soon thereafter, would you really commit all your true thoughts 
about Stalin and government to paper? In 1993, the famous Russian poet Osip Man- 
delstam wrote a short poem that criticized Stalin only indirectly without even nam- 
ing him, and he paid for this with his life (“Stalin Epigram”). 

Today, if you live in a pretty large part of the world, you know that the gov- 
ernment is likely to scan your electronic communications systematically (“Internet 
Censorship by country”). In some countries, citizens may be arrested simply for 
visiting a wrong website. In these countries, you will be careful about what you are 
saying online. Some of us live in other countries where a statement against the gov- 
ernment does not automatically put you in prison, and therefore people feel they 
can be more open. In other words, it does not matter if the government is track- 
ing us or not; what is important is what it can do with this information. (I grew up 
in the Soviet Union in the 1970s and then moved to the United States; based on 
my experience living in both societies, in this respect the difference between them 
is very big. In the USSR, we never made any political jokes on the phone and only 
discussed politics with close friends at home.) 

Now let us assume that we are living in a country where we are highly unlikely 
to be prosecuted for occasional antigovernment remarks. But still, how authen- 
tic are all the rest of our online expressions? As Ervin Goffman and other sociolo- 
gists pointed out a long time ago, people always construct their public presence, 
carefully shaping how they present themselves to others, and social media is cer- 
tainly not an exception to this (“Ervin Goffman”), though the degree of this public 
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self-construction varies. For instance, most of us tend to do less self-censorship and 
editing on Facebook than in the profiles on dating sites or in a job interview. Others 
carefully curate their profile pictures to construct an image they want to project. (If 
you scan your friends’ Facebook profile pictures, you are likely to find a big range). 
But just as we do in all other areas of our everyday lives, we exercise some control 
all the time when we are online—what we say, what we upload, what we show as 
our interests, and so on. 

Again, this does not mean that we can’t do interesting research by analyzing 
larger numbers of tweets, Facebook photos, YouTube videos, or any other social 
media site—we just have to keep in mind that all these data are not a transparent 
window into peoples’ imaginations, intentions, motifs, opinions, and ideas. It’s more 
appropriate to think of it as an interface people present to the world—that is, a par- 
ticular view that shows only some parts of their actual lives and imaginations and 
may also include other fictional data designed to project a particular image (Elli- 
son, Heino, and Gibbs). 

3. Is it really true that “we no longer have to choose between data size and data 
depth,” as I stated? Yes and no. Imagine this hypothetical scenario. On the one side, 
we have ethnographers who are spending years inside a particular community. On 
another side, we have computer scientists who never meet people in this commu- 
nity but have access to their social media and digital footprints—daily spatial tra- 
jectories captured with GPS and all video recorded by surveillance cameras, online 
and offline conversations, uploaded media, comments, likes, and so on. According 
to my earlier argument, both parties have “deep data,” but the advantage of a com- 
puter science team is that they can capture this data about hundreds of millions of 
people as opposed to only small community. 

How plausible is this argument? For thousands of years, we would learn about 
other people exclusively through personal interactions. Later, letter writing became 
an important new mechanism for building personal (especially romantic) relation- 
ships. In the twenty-first century, we can have access to a whole new set of machine- 
captured traces and records of individuals’ activities. Given that this situation is very 
new, it is to be expected that some people will find it hard to accept the concept that 
such machine records can be as meaningful in helping us to understand commu- 
nities and individuals as face-to-face interaction. They will argue that whatever the 
quality of the data sources, data analysis ideas, and algorithms used by computer 
scientists, they will never arrive at the same insights and understanding of people 
and dynamics in the community as ethnographers. They will say that even the most 
comprehensive social data about people which can be automatically captured via 
cameras, sensors, computer devices (phones, game consoles), and web servers can’t 
be used to arrive at the same “deep” knowledge. 

It is possible to defend both positions, but what if both are incorrect? I believe 
that in our hypothetical scenario, ethnographers and computer scientists have 
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access to different kinds of data. Therefore they are likely to ask different questions, 
notice different patterns, and arrive at different insights. 

This does not mean that the new computer-captured “deep surface” of data is 
less “deep” than the data obtained through long-term personal contact. In terms of 
the sheer number of “data points,” it is likely to be much deeper. However, many of 
these data points are quite different than the data points available to ethnographers. 

For instance, if you are physically present in some situation, you may notice 
some things that you would not notice if you were watching a high-res video of the 
same situation. But at the same time, if you do computer analysis of this video you 
may find patterns you would not notice if you were on the scene only physically. 
Of course, people keep coming up with new techniques that combine on-the-scene 
physical presence and computer and network-assisted techniques. For a good exam- 
ple of such innovation, see the Valley of the Khans project at UCSD. In this proj- 
ect, photos captured by small unmanned aerial vehicles sent out by an archeologi- 
cal team moving around a large area in Mongolia are immediately uploaded to a 
special National Geographic site, exploration.nationalgeographic.com. Thousands 
of people immediately start tagging these photos for interesting details, which tells 
archeologists what to look for on the ground (“Help Find Genghis Khan’s Tomb”). 

The questions of what can be discovered and understood with computer anal- 
ysis of social and cultural data versus traditional qualitative methods are particu- 
larly important for the digital humanities. My hypothetical example used data about 
social behavior, but the “data” can also be eighteenth-century letters of European 
thinkers, nineteenth-century maps and texts about railroads, hundreds of thou- 
sands of images uploaded by users to a Flickr group, or any other set of cultural 
artifacts. When we start reading these artifacts with computers, humanists become 
very nervous. 

I often experience this reaction when I lecture about digital humanities research 
done in my lab Software Studies Initiative at UCSD (softwarestudies.com). The lab 
focuses on the development of methods and tools for exploration and the research 
of massive cultural visual data, both digitized visual artifacts and contemporary 
visual and interactive media (“Software Studies: Cultural Analytics”). We use digi- 
tal image analysis and new visualization techniques to explore cultural patterns in 
large sets of images and video: user-generated video, visual art, magazine covers and 
pages, graphic design, photographs, feature films, cartoons, and motion graphics. 
Examples of the visual data sets we have analyzed include 20,000 pages of Science 
and Popular Science magazines issues published between 1872 and 1922, 780 paint- 
ings by Van Gogh, 4,535 covers of Time magazine (1923-2009), and 1,074,790 pages 
(“One Million Manga Pages”). 

In our experience, almost every time we analyze and then visualize a new image 
or video collection, or even a single time-based media artifact (a music video, a 
feature film, a video recording of a game play), we find some surprising new pat- 
terns. This applies equally to collections of visual artifacts about which we had few 
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a priori assumptions (for instance, one hundred and sixty-seven thousand images 
uploaded by users to “Art Now” Flickr) and artifacts that already were studied in 
detail by multiple authors. 

As an example of the latter, I will discuss a visualization of the film The Elev- 
enth Year by a famous twentieth-century Russian director Dziga Vertov (Manovich, 
“Visualizing Large Image Collections for Humanities Research”). The visualization 
itself can be downloaded from our Flickr account (Manovich, “Motion Studies: 
Vertov’s The Eleventh Year”). 

My sources were the digitized copy of the film provided by Austrian Film 
Museum and the information about all shot boundaries created manually by a 
museum researcher. (With other moving image sources, we use the open source 
software Shotdetect, which automatically detects most shot boundaries in a typical 
film.) The visualization uses only the first and last frame of every shot in the film, 
disregarding all other frames. Each shot is represented as a column: the first frame 
is on the top, and the last frame is right below. 

“Vertov” is a neologism invented by the director who adapted it as his last 
name early in his career. It comes from the Russian verb vertet, which means “to 
rotate something.” “Vertov” may refer to the basic motion involved in filming in the 
1920s—rotating the handle of a camera—and also the dynamism of film language 
developed by Vertov who, along with a number of other Russian and European art- 
ists and designers and photographers working in that decade, wanted to defamil- 
iarize reality by using dynamic diagonal compositions and shooting from unusual 
points of view. However, my visualization suggests a very different picture of Vertov. 
Almost every shot of The Eleventh Year starts and ends with practically the same 
composition and subject. In other words, the shots are largely static. Going back 
to the actual film and studying these shots further, we find that some of them are 
indeed completely static—such as the close-ups of people’s faces looking in various 
directions without moving. Other shots employ a static camera that frames some 
movement—such as working machines, or workers at work—but the movement is 
localized completely inside the frame (in other words, the objects and human fig- 
ures do not cross the view framed by the camera.) Of course, we may recall that a 
number of shots in Vertov’s most famous film, Man with a Movie Camera (1929), 
were designed specifically as opposites: shooting from a moving car meant that the 
subjects were constantly crossing the camera view. But even in this most experimen- 
tal of Vertov’s films, such shots constitute a very small part of a film. 

One of the typical responses to my lectures is that computers can’t lead to the 
same nuanced interpretation as traditional humanities methods and that they can’t 
help understand deep meanings of artworks. My response is that we don’t want to 
replace human experts with computers. As I will describe in the hypothetical sce- 
nario of working with one million YouTube documentary-style videos, we can use 
computers to quickly explore massive visual data sets and then select the objects for 
closer manual analysis. While computer-assisted examination of massive cultural 
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data sets typically reveals new patterns in this data that even the best manual “close 
reading” would miss—and of course, even an army of humanists will not be able 
to carefully close read massive data sets in the first place—a human is still needed 
to make sense of these patterns. 

Ultimately, completely automatic analysis of social and cultural data will not 
produce meaningful results today because computers still have a limited ability to 
understand the context of texts, images, video, and other media. (Recall the mistakes 
made by the IBM Watson artificial intelligence computer when it competed on the 
television quiz show Jeopardy! in early 2011 [“Watson (computer)”].) 

Ideally, we want to combine the human ability to understand and interpret— 
which computers can’t completely match yet—and the computer’s ability to analyze 
massive data sets using algorithms we create. Let us imagine the following research 
scenario. You want to study documentary-type YouTube videos created by users in 
country X during the period Y, and you were able to determine that the relevant data 
set contains one million videos. So what do you do next? Computational analysis 
would be perfect as the next step to map the overall “data landscape”: identify the 
most typical and most unique videos; automatically cluster all videos into a number 
of categories; find all videos that follow the same strategies, and so on. At the end of 
this analytical stage, you may be able to reduce the set of one million videos to one 
hundred videos, which represent it in a more comprehensive way than if you sim- 
ply used a standard sampling procedure. For instance, your reduced set may contain 
both the most typical and the most unique videos in various categories. Now that 
you have a manageable number of videos, you can actually start watching them. If 
you find some video to be particularly interesting, you can then ask the computer 
to fetch more videos that have similar characteristics, so you can look at all of them. 
At any point in the analysis, you can go back and forth between particular videos, 
groups of videos, and the whole collection of one million videos, experimenting 
with new categories and groupings. And just as Google Analytics allows you to select 
any subset of data about your website and look at its patterns over time (number of 
viewed pages) and space (where do visitors come from), you will be able to select 
any subset of the videos and see various patterns across this subsets. 

This is my vision of how we can study large cultural data sets, whether these are 
billions of videos on YouTube or billions of photos on Flickr, or smaller samples of 
semiprofessional or professional creative productions such as one hundred million 
images on deviantart.com or two hundred and fifty thousand design portfolios on 
coroflot.com. Since 2007, our lab has been gradually working on visualization tech- 
niques that would enable such research exploration. 

4, Imagine that you have software that combines large-scale automatic data 
analysis and interactive visualization. (We are gradually working to integrate vari- 
ous tools that we designed in our lab to create such a system. See “Cultural Ana- 
lytics Research Environment.”) If you also have skills to examine individual arti- 
facts and the openness to ask new questions, the software will help you to take 
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research in many new exiting directions. However, there are also many kinds of 
interesting questions that require expertise in computer science, statistics, and data 
mining—something that social and humanities researchers typically don’t have. 
This is another serious objection to the optimistic view of big data-driven humani- 
ties and social research I presented earlier. 

The explosion of data and the emergence of computational data analysis as the 
key scientific and economic approach in contemporary societies create new kinds 
of divisions. Specifically, people and organizations are divided into three categories: 
those who create data (both consciously and by leaving digital footprints), those 
who have the means to collect it, and those who have the expertise to analyze it. 
The first group includes pretty much everybody in the world who is using the web 
or mobile phones; the second group is smaller; and the third group is much smaller 
still. We can refer to these three groups as the new “data classes” of our “big-data 
society” (my neologisms). 

At Google, computer scientists are working on the algorithms that scan a web 
page a user is on currently and select which ads to display. At YouTube, computer 
scientists work on algorithms that automatically show a list of other videos deemed 
to be relevant to one you are currently watching. At BlogPulse, computer scientists 
work on algorithms that allow companies to use sentiment analysis to study the 
feelings that millions of people express about their products in blog posts. At cer- 
tain Hollywood movie studios, computer scientists work on algorithms that predict 
the popularity of forthcoming movies by analyzing tweets about them (it works). 
In each case, the data and the algorithms can also reveal really interesting things 
about human cultural behavior in general, but this is not what the companies who 
are employing these computer scientists are interested in. Instead, the analytics are 
used for specific business ends. For more examples, see “What People Want (and 
How to Predict it).” 

So what about the rest of us? Today we are given a variety of sophisticated 
and free software tools to select the content of interest to us from this massive and 
constantly expanding universe of professional media offerings and user-generated 
media. These tools include search engines, RSS feeds, and recommendation sys- 
tems. But while they can help you find what to read, view, listen to, play, remix, 
share, comment on, and contribute to, in general they are not designed for carry- 
ing systematic social and cultural research along the lines of the “cultural analytics” 
scenario I described earlier. 

While a number of free data analysis and visualization tools have become avail- 
able on the web during last few years (Many Eyes, Tableau, Google docs, etc.), they 
are not useful unless you have access to large social data sets. Some commercial web 
tools allow anybody to analyze certain kinds of trends in certain data sets they are 
coupled with in some limited ways (or, at least, they whet our appetites by show- 
ing what is possible). I am thinking of already mentioned Google Books Ngram 
Viewer, Trends, Insights for Search, Blogpulse, and also YouTube Trends Dashboard, 
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Social Radar, and Klout. (Searching for “social media analytics” or “Twitter analyt- 
ics” brings up lists of dozens of other tools.) 

For example, the Google Ngram Viewer plots relative frequencies of words or 
phrases you input across a few million English language books published over the 
last four hundred years and digitized by Google. (Data sets in other languages are 
also available.) You can use it to reveal all kinds of interesting cultural patterns. Here 
are some of my favorite combinations of words and phrases to use as input: “data, 
knowledge”; “engineer, designer”; “industrial design, graphic design.” In another 
example, YouTube Trends Dashboard allows you to compare most-viewed videos 
across different geographic locations and age groups. 

Still, what you can do with these tools today is quite limited. One of the rea- 
sons for this is that companies make money by analyzing patterns in the data they 
collect about our online and physical behavior and target their offerings, ads, sales 
events, and promotions accordingly; in other cases, they sell this data to other com- 
panies. Therefore they don’t want to give consumers direct access to all of their data. 
(According to an estimate by ComScore, in the end of 2007, five large web compa- 
nies were recording “at least 336 billion transmission events in a month” [“To Aim 
Ads, Web Is Keeping Closer Eye on You”].) 

If a consumer wants to analyze patterns in the data that constitutes or reflects 
her or his economic relations with a company, the situation is different. The com- 
panies often provide the consumers with professional-level analysis of this data— 
financial activities (e.g., my bank website shows a detailed breakdown of my spend- 
ing categories), websites and blogs (Google Analytics), or online ad campaigns 
(Google AdWords). 

Another relevant trend is to let a user compare her or his data against the sta- 
tistical summaries of data about others. For instance, Google Analytics shows the 
performance of my website against all websites of similar type, while many fitness 
devices and sites allow you to compare your performance against the summarized 
performance of other users. However, in each case, the companies do not open the 
actual data but only provide the summaries. 

Outside of the commercial sphere, we do see a gradual opening up of the data 
collected by government agencies. For U.S. examples, check Data.gov, HealthData. 
gov, and Radar.Oreilly.com. As Alex Howard notes in “Making Open Govern- 
ment Data Visualizations That Matter,” “every month, more open government 
data is available online. Local governments are becoming data suppliers.” Note, 
however, that these data are typically statistical summaries as opposed to trans- 
actional data (the traces of online behavior) or their media collected by social 
media companies. 

The limited access to massive amounts of transactional social data that is being 
collected by companies is one of the reasons why today large contemporary data- 
driven social science and large contemporary data-driven humanities are not easy to 
do in practice. (For examples of digitized cultural archives available at the moment, 
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see the list of repositories [“List of Data Repositories”] that agreed to make their 
data available to Digging into Data competitors.) Another key reason is the large gap 
between what can be done with the right software tools, right data, and no knowl- 
edge of computer science and advanced statistics and what can only be done if you 
do have this knowledge. 

For example, imagine that you were given full access to the digitized books used 
in Ngram Viewer (or maybe you created your own large data set by assembling texts 
from Project Guttenberg or another source) and you want software to construct 
graphs that show changing frequencies of topics over time, as opposed to individ- 
ual words. If you want to do this, you better have knowledge of computational lin- 
guistics or text mining. (A search for “topic analysis” on Google Scholar returned 
239,000 articles for the first field and 39,000 articles for the second newer field.) 

Or imagine that you were interested in how social media facilitates informa- 
tion diffusion, and you want to use Twitter data for your study. In this case, you 
can obtain the data using Twitter APIs or third-party services that collect this data 
and make it available for free or for a fee. But again, you must have the right back- 
ground to make use of this data. The software itself is free and readily available—R, 
Weka, Gate, Mallet, and so on—but you need the right training (at least some classes 
in computer science and statistics) and prior practical experience to get meaning- 
ful results. 

Here is an example of what can be done by people with the right background. 
In 2010, four researchers from the Computer Science Department at KAIST (South 
Korea’s leading university for technology) published a paper titled “What Is Twitter, 
a social network or a news media?” Using Twitter APIs, they were able to study the 
entire Twittersphere as of 2009: 41.7 million user profiles, 1.47 billion social rela- 
tions, 106 million tweets. Among their discoveries, over 85 percent of trending top- 
ics are “headline news or persistent news in nature.” (Note that the lead author on 
the paper was a PhD student. It is also relevant to note that the authors make their 
complete collected data sets freely available for download, so that they can be used 
by other researchers. For more examples of the analysis of “social flows,” see papers 
presented at IEEE International Conference on Social Computing 2010.) 

In this chapter, I have sketched an optimistic vision of a new paradigm opened 
to humanities and social sciences. I have then discussed four objections to this opti- 
mistic vision. There are other equally important objections that I did not discuss 
because they are already debated in popular media and in academia by many peo- 
ple. For example, a very big issue is privacy; would you trust academic researchers 
to have all your communication and behavior data automatically captured? 

So what conclusions should we draw from this analysis? Is it true that “surface 
is the new depth,” in a sense that the quantities of “deep” data that in the past were 
obtainable about a few can now be automatically obtained about many? Theoreti- 
cally, the answer is yes, as long as we keep in mind that the two kinds of deep data 
have different content. 
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Practically, there are a number of obstacles before this can become a reality. I 
have tried to describe a few of these obstacles, but there are also others I did not ana- 
lyze. However, with what we already can use today (social media companies, APIs, 
Infochimps.com data marketplace and data commons, free archives such as Proj- 
ect Guttenberg, Internet Archive, etc.), the possibilities are endless—if you know 
some programming and data analytics and also are open to asking new types of 
questions about human beings, their social lives, their cultural expressions, and 
their experiences. 

I have no doubt that eventually we will see many more humanities and social 
science researchers who will be equally as good at implementing the latest data 
analysis algorithms themselves, without relying on computer scientists, as they are 
at formulating abstract theoretical arguments. However, this requires a big change 
in how students in humanities are being educated. 

The model of big-data humanities research that exists now is that of collabora- 
tion between humanists and computer scientists. It is the right way to start “digging 
into data.” However, if each data-intensive project done in the humanities would 
have to be supported by a research grant, which would allow such collaboration, 
our progress will be very slow. We want humanists to be able to use data analysis 
and visualization software in their daily work, so they can combine quantitative and 
qualitative approaches in all their work. How to make this happen is one of the key 
questions for the digital humanities. 


NOTES 


1. lam grateful to UCSD faculty member James Fowler for an inspiring conversation 
a few years ago about the collapse of depth/surface distinction. See his work at jhfowler 
.ucsd.edu. 


2. More details at http://en.wikipedia.org/wiki/Google_Search. 
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